Reviewer's comment v1

Hello Yonas, this is Gerardo Flores 👋

You can find me on Discord as gerardo.flores.ds

I'm happy to review your project today 🙌

You can find my comments under the heading «Review». I will categorize my comments in green, blue or red boxes like this:

Success: if everything is done successfully
Remarks: if I can give some recommendations or ways to improve the project
Needs fixing: if the block requires some corrections. Work can't be accepted with the red comments

Please don't remove my comments. If you have any questions, don't hesitate to respond to my comments in a different section.

Student comments: For example like this
Reviewer's comment v1

General Feedback

You’ve submitted another project—great work! Your commitment to pushing through the challenges of this program is admirable.

After reviewing your submission, I consider it should be approved. You can find my more detailed feedback within your project file in the Reviewer's comment v1 section.

And of course, if you have any questions along the way, remember that you can always reach out to your tutor for any clarification.

Video Game Sales Analysis Project¶

Project Overview¶

In this project, you'll analyze video game sales data to identify patterns that determine a game's success. Working as an analyst for the online store Ice, you'll use this information to help plan future advertising campaigns.

Environment Setup and Required Libraries¶

Reviewer's comment v1

Great job adding additional information about the project goal. It would also be helpful to share the steps you plan to complete.

In [1]:
# Import all required libraries

import pandas as pd
import numpy as np
import math as mt
import plotly.express as px
from scipy import stats as st

Step 1: Loading and Initial Data Exploration¶

First, let's load our dataset and examine its basic properties:

In [2]:
# Load the dataset

data = pd.read_csv('/datasets/games.csv')
In [3]:
# Display basic information about the dataset

display(data)
print()
data.info()
print()
data.describe()
Name Platform Year_of_Release Genre NA_sales EU_sales JP_sales Other_sales Critic_Score User_Score Rating
0 Wii Sports Wii 2006.0 Sports 41.36 28.96 3.77 8.45 76.0 8 E
1 Super Mario Bros. NES 1985.0 Platform 29.08 3.58 6.81 0.77 NaN NaN NaN
2 Mario Kart Wii Wii 2008.0 Racing 15.68 12.76 3.79 3.29 82.0 8.3 E
3 Wii Sports Resort Wii 2009.0 Sports 15.61 10.93 3.28 2.95 80.0 8 E
4 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing 11.27 8.89 10.22 1.00 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
16710 Samurai Warriors: Sanada Maru PS3 2016.0 Action 0.00 0.00 0.01 0.00 NaN NaN NaN
16711 LMA Manager 2007 X360 2006.0 Sports 0.00 0.01 0.00 0.00 NaN NaN NaN
16712 Haitaka no Psychedelica PSV 2016.0 Adventure 0.00 0.00 0.01 0.00 NaN NaN NaN
16713 Spirits & Spells GBA 2003.0 Platform 0.01 0.00 0.00 0.00 NaN NaN NaN
16714 Winning Post 8 2016 PSV 2016.0 Simulation 0.00 0.00 0.01 0.00 NaN NaN NaN

16715 rows × 11 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB

Out[3]:
Year_of_Release NA_sales EU_sales JP_sales Other_sales Critic_Score
count 16446.000000 16715.000000 16715.000000 16715.000000 16715.000000 8137.000000
mean 2006.484616 0.263377 0.145060 0.077617 0.047342 68.967679
std 5.877050 0.813604 0.503339 0.308853 0.186731 13.938165
min 1980.000000 0.000000 0.000000 0.000000 0.000000 13.000000
25% 2003.000000 0.000000 0.000000 0.000000 0.000000 60.000000
50% 2007.000000 0.080000 0.020000 0.000000 0.010000 71.000000
75% 2010.000000 0.240000 0.110000 0.040000 0.030000 79.000000
max 2016.000000 41.360000 28.960000 10.220000 10.570000 98.000000

Since year_of_release is only missing 269 entries (1.6% of entries), and the year is an important metric in analysis, I will be substituting the median for missing values under said column. When it comes time for score and rating analysis I will be omitting entries that have missing values in this column, a significant number of them are missing, and substituting may skew our results.

In [4]:
# Check for duplicate entries

data.duplicated().sum()

#display(data[data['User_Score'] == 'tbd'])
Out[4]:
0

Key Questions to Answer:¶

  • What's the total number of records in our dataset?
  • What data types are present in each column?
  • Are there any obvious issues with the data?
  • Do we see any immediate patterns or anomalies?

We have 16,715 entries, with 11 columns. Mostly float64 dtype, with the expected object type for Name, Platform, Genre, and Rating. We also have 'tbd' lurking within the user_score column, causing it to become object dtype.

Reviewer's comment v1

Well done!

Step 2: Data Preparation¶

2.1 Standardizing Column Names¶

In [5]:
# Convert column names to lowercase

data.columns = data.columns.str.lower()
In [6]:
# Verify the changes

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB

column names successfully converted to lowercase

2.2 Data Type Conversion¶

In [7]:
# Check current data types

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16446 non-null  float64
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB
In [8]:
# Make changes to data types if necessary
# Describe the columns where the data types have been changed and why.

data['year_of_release'] = data['year_of_release'].fillna(2007)
data['year_of_release'] = data['year_of_release'].astype(int)
data.info()

#display(min(data['year_of_release']), max(data['year_of_release']))

#display(data)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16715 non-null  int64  
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       10014 non-null  object 
 10  rating           9949 non-null   object 
dtypes: float64(5), int64(1), object(5)
memory usage: 1.4+ MB

I've decided to fill in the missing year values with the median, in order to convert the column to int type since we're only looking at the year, and not month/day. With the range being from 1980 - 2016

In [9]:
# Pay attention to the abbreviation TBD (to be determined). Specify how you intend to handle such cases.

data['user_score'] = data['user_score'].replace('tbd', np.nan)
data['user_score'] = data['user_score'].astype(float)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16715 non-null  object 
 2   year_of_release  16715 non-null  int64  
 3   genre            16713 non-null  object 
 4   na_sales         16715 non-null  float64
 5   eu_sales         16715 non-null  float64
 6   jp_sales         16715 non-null  float64
 7   other_sales      16715 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       7590 non-null   float64
 10  rating           9949 non-null   object 
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB

The 'tbd' placeholder was preventing 'user_score' from being the correct dtype needed for future calculations. By changing these values to 'NaN', we can now make the proper dtype conversion, which will allow non-null entries to be used for calculations.

2.3 Handling Missing Values¶

In [10]:
# Examine missing values

display(data[data['name'].isna()])
print()
#display(data[data['user_score'].isna() & data['critic_score'].isna() & data['rating'].isna()])
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating
659 NaN GEN 1993 NaN 1.78 0.53 0.00 0.08 NaN NaN NaN
14244 NaN GEN 1993 NaN 0.00 0.00 0.03 0.00 NaN NaN NaN

We have many missing values within the critic_score, user_score, and rating columns. With only 2 entries who don't have a name or genre.

In [11]:
# Calculate percentage of missing values

ng_missing = (len(data[data['name'].isna()]) / len(data)) * 100
critic_missing = (len(data[data['critic_score'].isna()]) / len(data)) * 100
user_missing = (len(data[data['user_score'].isna()]) / len(data)) * 100
rating_missing = (len(data[data['rating'].isna()]) / len(data)) * 100

print(f'Percentage of entries missing "name" & "genre": {ng_missing}%')
print(f'Percentage of entries missing "critic_score": {critic_missing}%')
print(f'Percentage of entries missing "user_score": {user_missing}%')
print(f'Percentage of entries missing "rating": {rating_missing}%')
Percentage of entries missing "name" & "genre": 0.011965300628178284%
Percentage of entries missing "critic_score": 51.31917439425665%
Percentage of entries missing "user_score": 54.59168411606342%
Percentage of entries missing "rating": 40.478612025127134%
In [12]:
# Analyze patterns in missing values

#display(data[data['year_of_release'] < 1990])
#display(data[(data['year_of_release'] < 2000) & (data['platform'] == 'PS')])
display(data[data['year_of_release'] > 2010])
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating
16 Grand Theft Auto V PS3 2013 Action 7.02 9.09 0.98 3.96 97.0 8.2 M
23 Grand Theft Auto V X360 2013 Action 9.66 5.14 0.06 1.41 97.0 8.1 M
29 Call of Duty: Modern Warfare 3 X360 2011 Shooter 9.04 4.24 0.13 1.32 88.0 3.4 M
31 Call of Duty: Black Ops 3 PS4 2015 Shooter 6.03 5.86 0.36 2.38 NaN NaN NaN
33 Pokemon X/Pokemon Y 3DS 2013 Role-Playing 5.28 4.19 4.35 0.78 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
16703 Strawberry Nauts PSV 2016 Adventure 0.00 0.00 0.01 0.00 NaN NaN NaN
16707 Aiyoku no Eustia PSV 2014 Misc 0.00 0.00 0.01 0.00 NaN NaN NaN
16710 Samurai Warriors: Sanada Maru PS3 2016 Action 0.00 0.00 0.01 0.00 NaN NaN NaN
16712 Haitaka no Psychedelica PSV 2016 Adventure 0.00 0.00 0.01 0.00 NaN NaN NaN
16714 Winning Post 8 2016 PSV 2016 Simulation 0.00 0.00 0.01 0.00 NaN NaN NaN

4022 rows × 11 columns

After running multiple parameters, I cannot seem to find a pattern in the entries with missing values. My initial guess was maybe the rating system was developed at a certain time period, perhaps pre 2000, but there exist entries with a non-null user or critic score pre 2000, even as far back as the 1980's.

Another theory I had was that maybe games only get ratings once per title, regardless of if that title released on multiple platforms (i.e. GTA V being released on X360 & PS3), but there exist entries that prove this theory false.

In [13]:
# Handle missing values based on analysis
# Your code here to handle missing values according to your strategy

data.dropna(subset=['name'], inplace=True)

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16713 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16713 non-null  object 
 2   year_of_release  16713 non-null  int64  
 3   genre            16713 non-null  object 
 4   na_sales         16713 non-null  float64
 5   eu_sales         16713 non-null  float64
 6   jp_sales         16713 non-null  float64
 7   other_sales      16713 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       7590 non-null   float64
 10  rating           9949 non-null   object 
dtypes: float64(6), int64(1), object(4)
memory usage: 1.5+ MB

I've decided to remove the 2 entries missing 'name' & 'platform', since their sales were so small they won't affect our analysis. These 2 entries account for .01% of our entire data set.

Earlier in this analysis I substituted the missing values under 'year_of_release' with the median (2007), I feel confident in this considering the mean is (2006.48)

In [14]:
# Why do you think the values are missing? Give possible reasons.
# Explain why you filled in the missing values as you did or why you decided to leave them blank.

Percentage of entries missing "critic_score": 51.31917439425665%

Percentage of entries missing "user_score": 54.59168411606342%

Percentage of entries missing "rating": 40.478612025127134%

With the quantity of games here, and the subjectivity of scores, it's possible that a score could either not be settled upon when gathering this data (which would explain the 1985 Super Mario Bro's), or some of these titles didn't sell enough to be considered for a score. The reality is likely a combination of both, but with the sheer amount of data missing, I believe the best course of action would be to leave these blank, and create new dataframes containing the valid entries when it comes time to analyze the scores and rating.

Reviewer's comment v1

Well done!

2.4 Calculate Total Sales¶

In [15]:
# Calculate total sales across all regions and put them in a different column

data['total_sales'] = data['na_sales'] + data['eu_sales'] + data['jp_sales'] + data['other_sales']

data.info()
display(data.sample(30, random_state=999))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16713 entries, 0 to 16714
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             16713 non-null  object 
 1   platform         16713 non-null  object 
 2   year_of_release  16713 non-null  int64  
 3   genre            16713 non-null  object 
 4   na_sales         16713 non-null  float64
 5   eu_sales         16713 non-null  float64
 6   jp_sales         16713 non-null  float64
 7   other_sales      16713 non-null  float64
 8   critic_score     8137 non-null   float64
 9   user_score       7590 non-null   float64
 10  rating           9949 non-null   object 
 11  total_sales      16713 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 1.7+ MB
name platform year_of_release genre na_sales eu_sales jp_sales other_sales critic_score user_score rating total_sales
3256 Prince of Persia: The Forgotten Sands X360 2010 Action 0.30 0.26 0.00 0.06 74.0 7.8 T 0.62
10572 Football Director DS DS 2008 Sports 0.00 0.10 0.00 0.00 NaN NaN NaN 0.10
9920 Spider-Man: Edge of Time 3DS 2011 Action 0.08 0.03 0.00 0.01 50.0 6.4 T 0.12
10204 Snowboard Kids 2 N64 1999 Sports 0.09 0.02 0.00 0.00 NaN NaN NaN 0.11
13602 Record of Agarest War: Marriage PSP 2012 Role-Playing 0.00 0.00 0.04 0.00 NaN NaN NaN 0.04
3156 Real Heroes: Firefighter Wii 2009 Action 0.56 0.04 0.00 0.05 67.0 8.2 E10+ 0.65
15800 Canada Hunt Wii 2010 Sports 0.02 0.00 0.00 0.00 NaN NaN T 0.02
4762 Tak and the Power of Juju GC 2003 Platform 0.31 0.08 0.00 0.01 71.0 8.2 E 0.40
8898 High Velocity Bowling PS3 2010 Sports 0.13 0.00 0.00 0.01 66.0 8.1 E10+ 0.14
13190 Harukanaru Toki no Naka de 3 PS2 2004 Adventure 0.00 0.00 0.05 0.00 NaN NaN NaN 0.05
5243 Formula 1 Championship Edition PS 1997 Racing 0.06 0.04 0.23 0.02 NaN NaN NaN 0.35
16326 Victory Spike PS 1996 Sports 0.00 0.00 0.01 0.00 NaN NaN NaN 0.01
5991 Imagine: Fashion Designer World Tour DS 2009 Simulation 0.27 0.00 0.00 0.02 NaN NaN E 0.29
8472 TransFormers Animated: The Game DS 2008 Action 0.15 0.00 0.00 0.01 69.0 NaN E 0.16
11997 The Legend of Heroes VII: The Trail of Zero PSV 2012 Role-Playing 0.00 0.00 0.07 0.00 NaN NaN NaN 0.07
9040 Thor: God of Thunder X360 2011 Action 0.09 0.04 0.00 0.01 38.0 5.7 T 0.14
3672 Top Gun: Fire at Will! PS 1996 Misc 0.30 0.21 0.00 0.04 NaN NaN NaN 0.55
13876 Doraemon: Shin Nobita no Nihon Tanjou 3DS 2016 Misc 0.00 0.00 0.04 0.00 NaN NaN NaN 0.04
5080 Imagine: Party Babyz Wii 2008 Simulation 0.33 0.02 0.00 0.03 NaN 6.1 E 0.38
11266 Monster Jam: Urban Assault PSP 2008 Racing 0.08 0.00 0.00 0.01 NaN NaN E 0.09
10364 Pro Yakyuu Team o Tsukurou! 2 DS 2009 Sports 0.00 0.00 0.11 0.00 NaN NaN NaN 0.11
3825 Tiger Woods PGA Tour 11 X360 2010 Sports 0.28 0.19 0.00 0.05 79.0 7.8 E 0.52
9462 Mega Man Anniversary Collection XB 2005 Platform 0.10 0.03 0.00 0.00 80.0 9.2 E 0.13
9336 Rogue Warrior PS3 2009 Shooter 0.11 0.01 0.00 0.01 27.0 2.8 M 0.13
2802 Practise English! DS 2007 Misc 0.00 0.00 0.73 0.00 NaN NaN NaN 0.73
3984 Star Wars: Clone Wars PS2 2002 Shooter 0.24 0.19 0.00 0.06 NaN NaN NaN 0.49
2936 Minecraft: Story Mode PS4 2015 Adventure 0.25 0.33 0.00 0.11 NaN NaN NaN 0.69
722 Spyro: Season of Ice GBA 2001 Platform 1.29 0.83 0.00 0.11 74.0 8.0 E 2.23
12836 Nobunaga's Ambition: Sphere of Influence - Sen... PS4 2016 Misc 0.01 0.00 0.05 0.00 NaN NaN NaN 0.06
7160 NCAA Football 2005 GC 2004 Sports 0.17 0.04 0.00 0.01 88.0 9.0 E 0.22

Step 3: Analyzing Video Game Sales Data¶

3.1 Temporal Analysis of Game Releases¶

Let's first examine the distribution of game releases across different years to understand our data's coverage and significance:

In [16]:
# Create a DataFrame with game releases by year

year_data = data[['name', 'platform', 'year_of_release']]
year_data = year_data.sort_values(by='year_of_release')

display(year_data)
year_data.info()
name platform year_of_release
1968 Defender 2600 1980
6300 Bridge 2600 1980
1764 Kaboom! 2600 1980
546 Missile Command 2600 1980
5360 Freeway 2600 1980
... ... ... ...
14404 Prison Architect PS4 2016
11168 Terraria 3DS 2016
14414 Atelier Shallie Plus: Alchemists of the Dusk Sea PSV 2016
11243 Assassin's Creed Chronicles PS4 2016
16714 Winning Post 8 2016 PSV 2016

16713 rows × 3 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16713 entries, 1968 to 16714
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             16713 non-null  object
 1   platform         16713 non-null  object
 2   year_of_release  16713 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 522.3+ KB
In [17]:
# Visualize the distribution of games across years

fig = px.histogram(year_data, x='year_of_release', width=1000, height=600)

fig.update_layout(title_text='Quantity of Games Released by Year',
                  xaxis_title_text='Year', yaxis_title_text='Quantity')
fig.show()
In [18]:
# Display summary statistics for each year

#Summary Statistics Calculation for the Entire Sample:

def mean_var_stdv(series):
    mean = series.mean()
    var = series.var()
    stdv = mt.sqrt(var)
    return f'Mean: {mean}\nVariance: {var}\nSt.Deviation: {stdv}'

print(f'Temporal Summary Statistics for all Observable Years:\n\n{mean_var_stdv(year_data["year_of_release"])}\n')
print(f'-~-~-~-' * 21)
print(f'\nDataFrame Containing Amount of Games Released and Corresponding Percentage, Grouped by Platform and Year:\n')

#Creating new DataFrame containing the amount of titles released each year by platform, as well as the corresponding share percentage (ex. 10 releases in one year / 100 releases across all years = 10% of sales in that year)

total_by_year = year_data.groupby(['platform', 'year_of_release'])['name'].count().reset_index()
total_by_year = total_by_year.rename(columns={'year_of_release' : 'year', 'name' : 'titles_released_count'})

lifetime_total_series = total_by_year.groupby('platform')['titles_released_count'].transform(sum)
total_by_year['platform_lifetime_percentage'] = ((total_by_year['titles_released_count'] / lifetime_total_series) * 100).map('{:.2f}%'.format)

display(total_by_year)

#display(lifetime_total_series)
Temporal Summary Statistics for all Observable Years:

Mean: 2006.4945252198886
Variance: 33.970305711920766
St.Deviation: 5.828405074453968

-~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~--~-~-~-

DataFrame Containing Amount of Games Released and Corresponding Percentage, Grouped by Platform and Year:

platform year titles_released_count platform_lifetime_percentage
0 2600 1980 9 6.77%
1 2600 1981 46 34.59%
2 2600 1982 36 27.07%
3 2600 1983 11 8.27%
4 2600 1984 1 0.75%
... ... ... ... ...
239 XB 2008 1 0.12%
240 XOne 2013 19 7.69%
241 XOne 2014 61 24.70%
242 XOne 2015 80 32.39%
243 XOne 2016 87 35.22%

244 rows × 4 columns

1. Temporal Summary Statistics match what we can see in the previously shown histogram, with the peak lying somewhere between 2006 - 2010. There is a higher volume of data on the left side of our median, causing the mean to be slightly lower than if we had a balanced distribution. This is also proven by the fact that our (-3 sigma)(2006.49 - (5.83*3)) or (1989) is observable, but only our (+1 sigma)(2006.49 + 5.83) or (2012) is visible.

2. Summary Statistics by Platform and Year, this is a DataFrame containing every observable year in our dataset, with the amount of games released on each platform serving as entries. The percentage of each entry is calculated using the total games released per platform per year, divided by the total number of games released over the lifespan of each platform.

Questions to Consider:¶

  • Which years show significant numbers of game releases?
  • Are there any notable trends or patterns in the number of releases?
  • Is there enough recent data to make predictions for 2017?

Our first sharp increase in games released was in the year 2000, seeing a 236.86% increase in the next 2 years (2000 - 2002). Something to consider: "The Matrix" trilogy was in the process of unfolding (1999 - 2003), and with it brought a wider audience into the 'cyberspace', likely contributing to this surge. A bulk of our data lies between the years of 2007 - 2011, seeing a consistent increase in the gaming market. With the introduction of YouTube in 2005, gaming would see a massive market shift, partly explaining the huge increase mentioned previously.

I believe with our data set it is possible to make predictions based on observable trends, however, there is no certainty in the future. Weather predictions are a great example of this.

Reviewer's comment v1

Great job!

3.2 Platform Sales Analysis Over Time¶

Now let's analyze how sales vary across platforms and years:

In [19]:
# Calculate total sales by platform and year

plt_yr_total = data.groupby(['platform', 'year_of_release'])['total_sales'].sum()
plt_yr_total = plt_yr_total.unstack(level='year_of_release')
plt_yr_total = plt_yr_total.fillna(0)

display(plt_yr_total)
plt_yr_total.info()

#display(data['platform'].value_counts())

#year_plt_sales = data[['name', 'platform', 'total_sales', 'year_of_release']]
#year_plt_sales = year_plt_sales.sort_values(by='year_of_release')
#year_plt_sales.info()
#display(year_plt_sales)
year_of_release 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
platform
2600 11.38 35.68 28.88 5.84 0.27 0.45 0.67 1.94 0.74 0.63 ... 10.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3DO 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3DS 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 1.19 0.00 0.00 0.00 63.20 51.36 56.57 43.76 27.78 15.14
DC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.02 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
DS 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 ... 150.28 145.31 119.54 85.02 26.18 11.01 1.54 0.00 0.00 0.00
GB 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.43 64.97 ... 1.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GBA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 8.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 2.47 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GEN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GG 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
N64 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
NES 0.00 0.00 0.00 10.96 50.08 53.44 36.41 19.76 45.01 7.85 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
NG 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
PC 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00 0.03 0.00 ... 13.04 12.42 16.91 24.28 35.03 23.22 12.38 13.28 8.52 5.25
PCFX 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
PS 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 3.28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
PS2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 98.20 53.90 26.40 5.64 0.45 0.00 0.00 0.00 0.00 0.00
PS3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 81.50 118.52 130.93 142.17 156.78 107.36 113.25 47.76 16.82 3.60
PS4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 25.99 100.00 118.90 69.25
PSP 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 51.45 34.55 37.78 35.04 17.82 7.69 3.14 0.24 0.12 0.00
PSV 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.26 0.00 0.00 0.00 4.63 16.19 10.59 11.90 6.25 4.25
SAT 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SCD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SNES 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
TG16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
WS 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Wii 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 169.10 171.32 206.97 127.95 59.65 21.71 8.59 3.75 1.14 0.18
WiiU 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 17.56 21.65 22.03 16.35 4.60
X360 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 105.59 135.26 120.29 170.03 143.84 99.74 88.58 34.74 11.96 1.52
XB 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 6.72 0.18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
XOne 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 18.96 54.07 60.14 26.15

31 rows × 37 columns

<class 'pandas.core.frame.DataFrame'>
Index: 31 entries, 2600 to XOne
Data columns (total 37 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1980    31 non-null     float64
 1   1981    31 non-null     float64
 2   1982    31 non-null     float64
 3   1983    31 non-null     float64
 4   1984    31 non-null     float64
 5   1985    31 non-null     float64
 6   1986    31 non-null     float64
 7   1987    31 non-null     float64
 8   1988    31 non-null     float64
 9   1989    31 non-null     float64
 10  1990    31 non-null     float64
 11  1991    31 non-null     float64
 12  1992    31 non-null     float64
 13  1993    31 non-null     float64
 14  1994    31 non-null     float64
 15  1995    31 non-null     float64
 16  1996    31 non-null     float64
 17  1997    31 non-null     float64
 18  1998    31 non-null     float64
 19  1999    31 non-null     float64
 20  2000    31 non-null     float64
 21  2001    31 non-null     float64
 22  2002    31 non-null     float64
 23  2003    31 non-null     float64
 24  2004    31 non-null     float64
 25  2005    31 non-null     float64
 26  2006    31 non-null     float64
 27  2007    31 non-null     float64
 28  2008    31 non-null     float64
 29  2009    31 non-null     float64
 30  2010    31 non-null     float64
 31  2011    31 non-null     float64
 32  2012    31 non-null     float64
 33  2013    31 non-null     float64
 34  2014    31 non-null     float64
 35  2015    31 non-null     float64
 36  2016    31 non-null     float64
dtypes: float64(37)
memory usage: 9.2+ KB
In [20]:
# Create a heatmap of platform sales over time

fig = px.imshow(plt_yr_total, labels=dict(x="Year", y="Platform", color="Total Sales"),
                x=plt_yr_total.columns, y=plt_yr_total.index, width=1000, height=800, 
                title='Total Game Sales (USD Million) by Platform and Year')
fig.show()
In [21]:
# Identify platforms with declining sales

Any platform that has been usurped by newer technology will begin to decline, as we can see with the platforms 'PS, PS2, PS3, PS4'. When sales peak and begin to decline, that is when the market seemingly migrates to the next big thing. There are additional "unknown" factors that contribute to sales, example being the announcement of a newer console may influence people to halt consumption and wait for the newer one. Another factor worth consideration is console exclusive titles that may switch certain consumers from one platform to the other, if they do not wish to possess multiple.

Questions to Consider:¶

  • Which platforms show consistent sales over time?
  • Can you identify platforms that have disappeared from the market?
  • What's the typical lifecycle of a gaming platform?

The leading platform in sales consistency has to be 'PC' at 31 years (1985 - 2016), which makes sense since there are no definitive hardware constraints by which to categorize PC from generation to generation.

'GB' comes in second, with consistent sales accross a surprising 13 year span (1988 - 2001). Finally, 'NES' comes in third for consistency in sales over time with 11 years (1983 -1994).

When looking for 'consistency', we want the most stable record over a given time period. Not including platforms that saw meteoric rises before succumbing to their successors (in the case of Playstation's four generations, and multiple handhelds).

Reviewer's comment v1

Great job!

3.3 Determining Relevant Time Period¶

Based on your analysis above, determine the appropriate time period for predicting 2017 sales:

In [22]:
# Your code here to filter the dataset to relevant years
# Example:
# relevant_years = [XXXX, XXXX, XXXX] # Replace with your chosen years
# df_relevant = df[df['year_of_release'].isin(relevant_years)]

# Justify your choice with data

select_years = [2011, 2012, 2013, 2014, 2015, 2016]
data_select_years = data[['platform', 'year_of_release', 'total_sales']]
data_select_years = data_select_years[data_select_years['year_of_release'].isin(select_years)].sort_values(by='year_of_release')
data_select_years = data_select_years.groupby(['platform', 'year_of_release'])['total_sales'].sum().reset_index()

#data_select_years = data_select_years.unstack(level='year_of_release').fillna(0)

display(data_select_years.head(20))

#data_select_years.info()
platform year_of_release total_sales
0 3DS 2011 63.20
1 3DS 2012 51.36
2 3DS 2013 56.57
3 3DS 2014 43.76
4 3DS 2015 27.78
5 3DS 2016 15.14
6 DS 2011 26.18
7 DS 2012 11.01
8 DS 2013 1.54
9 PC 2011 35.03
10 PC 2012 23.22
11 PC 2013 12.38
12 PC 2014 13.28
13 PC 2015 8.52
14 PC 2016 5.25
15 PS2 2011 0.45
16 PS3 2011 156.78
17 PS3 2012 107.36
18 PS3 2013 113.25
19 PS3 2014 47.76

Document Your Decision:¶

  • What years did you select and why?
  • How does this period reflect current market conditions?
  • What factors influenced your decision?

I've decided to go with 2011 - 2016, since it should be easier to make a prediction with a stable trajectory. Kind of like analyzing the influence of gravity on a projectile.

As for current market conditions, we don't know for certain the reason of declining title releases over time. However, we have to acknowledge the trending decline, since titles have to be released in order for sales to occur, which is the primary reason for my time frame selection.

Reviewer's comment v1

Great time frame decision!

3.4 Platform Performance Analysis¶

Using your selected time period, let's analyze platform performance:

In [23]:
# Analyze platform sales trends

display(data_select_years.head(10))
platform year_of_release total_sales
0 3DS 2011 63.20
1 3DS 2012 51.36
2 3DS 2013 56.57
3 3DS 2014 43.76
4 3DS 2015 27.78
5 3DS 2016 15.14
6 DS 2011 26.18
7 DS 2012 11.01
8 DS 2013 1.54
9 PC 2011 35.03

With the first 10 rows visible, we can see that the 'DS' was at the tail end of it's lifecycle (2011 - 2013), 'passing the torch' to the next generation of console '3DS'. Both platforms would see sales concurrently (2011 - 2013) until titles were no longer being developed for the 'parent' console, or 'DS', giving way to the 'child' console, or '3DS'. This trend occurs accross multiple platforms in this data analysis, the only exception being 'PC', where no generational gaps can be distinguished.

In [24]:
# Sort platforms by total sales

#data_select_years['total_sales'] = data_select_years.sum(axis=1)
data_select_years = data_select_years.sort_values(by='total_sales')

display(data_select_years.head(20))
data_select_years.info()
platform year_of_release total_sales
30 PSP 2015 0.12
42 Wii 2016 0.18
29 PSP 2014 0.24
15 PS2 2011 0.45
41 Wii 2015 1.14
53 X360 2016 1.52
8 DS 2013 1.54
28 PSP 2013 3.14
21 PS3 2016 3.60
40 Wii 2014 3.75
36 PSV 2016 4.25
47 WiiU 2016 4.60
31 PSV 2011 4.63
14 PC 2016 5.25
35 PSV 2015 6.25
27 PSP 2012 7.69
13 PC 2015 8.52
39 Wii 2013 8.59
33 PSV 2013 10.59
7 DS 2012 11.01
<class 'pandas.core.frame.DataFrame'>
Int64Index: 58 entries, 30 to 16
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   platform         58 non-null     object 
 1   year_of_release  58 non-null     int64  
 2   total_sales      58 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.8+ KB
In [25]:
# Visualize top platforms
# Calculate year-over-year growth for each platform
# Your code here to calculate and visualize platform growth rates

#fig = px.histogram(data_select_years, x='platform', y='total_sales', 
                   #color='year_of_release', barmode='group', height=500, 
                   #labels={'total_sales' : 'Total Sales'})

#Visualizing platform performance by year, and changes in sales over time:

data_select_years = data_select_years.sort_values(by='year_of_release')

import plotly.express.colors as px_colors

fig = px.line(data_select_years, x='year_of_release', y='total_sales',
                 color='platform', symbol='platform', height=800,
                 labels={'total_sales' : 'Total Sales (USD Million)', 'year_of_release' : 'Year'},
              title='Which Platform Sold the Most Titles from 2011 to 2016?',
             color_discrete_sequence=px_colors.qualitative.Dark24)

fig.show()

I've chosen to use a line graph in order to showcase both the yearly standing, and changes over time for each console. With the amount of platforms on display here, it was vital to ensure visual clarity, by utilizing an external color pack and symbols.

As stated earlier, there are multiple generations of platform on display here. New hardware launches and there are a few years of simultaneous market activity in both, before the predeccessor is inevitably rendered obsolete when titles cease to launch. Some platforms have milder and more consistent market trends (Wii & WiiU) than the top selling platforms (PS3 & PS4), but loosely follow the same pattern of generational succession.

Reviewer's comment v1

Well done!

3.5 Sales Distribution Analysis¶

Let's examine the distribution of sales across platforms:

In [26]:
# Create box plot of sales by platform

fig = px.box(data_select_years, x='platform', y='total_sales',
            color='platform', height=700, labels={'total_sales' : 'Total Sales (USD Million)', 'platform' : 'Platform'},
            points='all', title='2011 to 2016: Annual Sales Distribution Analysis')
fig.show()

#display(data_select_years)

This box plot shows us a similar pattern, but in different format. With the older generations of platform beginning to trend downwards, the closer to 0 the min value is, the closer to the end of the platforms market life cycle. Sales are still reffering to games sold under each platform.

In [27]:
# Calculate detailed statistics for each platform

boxplot_summary_stats = data_select_years.groupby('platform')['total_sales'].describe()
display(boxplot_summary_stats)
count mean std min 25% 50% 75% max
platform
3DS 6.0 42.968333 18.278261 15.14 31.7750 47.560 55.2675 63.20
DS 3.0 12.910000 12.429397 1.54 6.2750 11.010 18.5950 26.18
PC 6.0 16.280000 11.007403 5.25 9.4850 12.830 20.7350 35.03
PS2 1.0 0.450000 NaN 0.45 0.4500 0.450 0.4500 0.45
PS3 6.0 74.261667 60.695667 3.60 24.5550 77.560 111.7775 156.78
PS4 4.0 78.535000 40.567917 25.99 58.4350 84.625 104.7250 118.90
PSP 5.0 5.802000 7.387558 0.12 0.2400 3.140 7.6900 17.82
PSV 6.0 8.968333 4.729695 4.25 5.0350 8.420 11.5725 16.19
Wii 6.0 15.836667 22.864073 0.18 1.7925 6.170 18.4300 59.65
WiiU 5.0 16.438000 7.068244 4.60 16.3500 17.560 21.6500 22.03
X360 6.0 63.396667 56.074638 1.52 17.6550 61.660 96.9500 143.84
XOne 4.0 39.830000 20.313977 18.96 24.3525 40.110 55.5875 60.14

A more refined look at our data, but still the same message is conveyed: min values and their proximity to the mean tells us what kind of trajectory our platforms are on. 'PC' emerging in the 1990's and keeping with a consistent trend even up until 2016. On the other hand we have the 'PS3', the best performing console within this sample, but rapidly approaching zero, as the successor 'PS4' enters the market.

3.6 Review Score Impact Analysis¶

Select a popular platform and analyze how reviews affect sales:

In [28]:
# Choose a popular platform based on your previous analysis

popular_platform = data[(data['platform'] == 'PS3') & (data['year_of_release'] >= 2011)][['name', 'year_of_release', 'critic_score', 'user_score', 'total_sales']].dropna()

display(popular_platform)
name year_of_release critic_score user_score total_sales
16 Grand Theft Auto V 2013 97.0 8.2 21.05
34 Call of Duty: Black Ops II 2012 83.0 5.3 13.79
37 Call of Duty: Modern Warfare 3 2011 88.0 3.2 13.33
69 Call of Duty: Ghosts 2013 71.0 2.6 9.36
81 FIFA Soccer 13 2012 88.0 6.6 8.17
... ... ... ... ... ...
15533 Battle Princess of Arcadias 2013 69.0 7.9 0.02
15567 Tom Clancy's Splinter Cell Classic Trilogy HD 2011 67.0 6.8 0.02
15742 Invizimals: The Lost Kingdom 2013 50.0 5.0 0.02
16146 Don Bradman Cricket 14 2014 73.0 6.9 0.01
16230 Madden NFL 13 2012 83.0 5.5 0.01

342 rows × 5 columns

The console that saw the most sales in our selective analysis was undoubtedly the 'PS3', with the largest mean/stdv pairing of all platforms. This lets us peer into the vastness of this platform and it's performative reach.

Since we're still under "3.x" I will continue the selected year trend to conduct further analysis until "4.0", or explicitly said otherwise.

As I mentioned at the beginning of this analysis, I will be omitting all entries with missing values in the event of 'critic_score' and 'user_score' analysis. As these entries have no value in this angle of observation.

In [29]:
# Create scatter plots for both critic and user scores

popular_platform_twinbox = popular_platform.melt(id_vars=['name', 'year_of_release', 'total_sales'],
                                                 value_vars=['critic_score', 'user_score'], var_name='score_type',
                                                 value_name='score')

popular_platform_twinbox['score'] = popular_platform_twinbox.apply(lambda row: row['score'] * 10 if row['score_type'] == 'user_score' else
                                                          row['score'], axis=1)

#display(popular_platform_twinbox)

fig = px.strip(popular_platform_twinbox, x='year_of_release', y='score',
                 color='score_type', facet_col='score_type', height=900, hover_data=['name'],
               labels={'year_of_release' : 'Year', 'score' : 'Score (out of 100)'},
              title='How Similar are the Review Scores of Critics and Users?')
fig.update_traces(jitter=1)
fig.show()

Scores between the two sources seem to be mildy similar at a glance, with some entries being within 5 points of each other (2013 'The Last of Us' at 95 vs 91), and some being over 10 points apart (2013 'Grand Theft Auto V' at 97 vs 82). There may be any number of reasons for the differing scores, but in my experience within this domain: critics tend to have different motives for playing vs users (duty vs leisure).

It's also worth considering that critics may receive the title for free before review, since their scores can influence the market. With this in mind, reviewing a game that costs nothing may positively skew the players experience (critic_score). While paying for the title sets an expectation of said product (user_score). This dynamic will not be true in every case, and could be one of many factors, but in the case of 'MindJack' released in 2011, this potentially explains the massive gap between a Critic Score of 44, and User Score of 17.

In [30]:
# Critic Scores

critic_scores = popular_platform['critic_score']

# User Scores

user_scores = popular_platform['user_score']

# Calculate correlations

score_correlation = critic_scores.corr(user_scores)

print(score_correlation)
0.5633145093196069

With a Correlation Coefficient of .56 between the two score types, we can say there is a moderate positive correlation. Critics and Users have similar taste, but not identical.

3.7 Cross-Platform Comparison¶

Compare sales performance of games across different platforms:

In [31]:
# Find games released on multiple platforms

cross_platform = data[(data['year_of_release'] >= 2011) & (data['total_sales'] >= 1.0)][['name', 'year_of_release', 'platform', 'total_sales']]
cross_platform = cross_platform[cross_platform['name'].duplicated(keep=False)]

display(cross_platform)

cross_platform.info()
name year_of_release platform total_sales
16 Grand Theft Auto V 2013 PS3 21.05
23 Grand Theft Auto V 2013 X360 16.27
29 Call of Duty: Modern Warfare 3 2011 X360 14.73
31 Call of Duty: Black Ops 3 2015 PS4 14.63
34 Call of Duty: Black Ops II 2012 PS3 13.79
... ... ... ... ...
2034 Skylanders SWAP Force 2013 PS3 1.02
2036 Sonic Generations 2011 3DS 1.01
2048 Need for Speed: Most Wanted 2012 PSV 1.01
2050 Assassin's Creed: Rogue 2014 X360 1.01
2091 Dark Souls 2011 X360 1.00

254 rows × 4 columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 254 entries, 16 to 2091
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             254 non-null    object 
 1   year_of_release  254 non-null    int64  
 2   platform         254 non-null    object 
 3   total_sales      254 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 9.9+ KB
In [32]:
# Compare sales across platforms for these games
# Your code here to analyze and visualize cross-platform performance

fig = px.scatter(cross_platform, x='name', y='total_sales',
                 color='name', symbol='platform', height=800, size='total_sales',
                 labels={'total_sales' : 'Total Sales (USD Million)', 'name' : 'Game Title', 'platform' : 'Platform'},
                 title='How Well do Cross-Platform Games Perform?')

fig.show()

This is a ton of data points, and for the sake of visual clarity I've decided to ommit any games that did not cross 1 million usd in sales.

With that in mind, we can see that games releasing on multiple platforms tend to net more sales overall. With some of the largest titles releasing on consecutive generations of platform, which we can easily see thanks to plotly's wonderful legend layout.

Earlier platform popularity observations extend to this analysis, seeing that there is a heirarchy of which console sells the most games. 'PS3' will almost always lead, with 'X360' following close behind, and the next generations ('PS4' & 'XOne') making their debut, with some titles re-releasing on the newer hardware.

3.8 Genre Analysis¶

Finally, let's examine the distribution of games by genre:

In [33]:
# Analyze genre performance

genre_analysis = data[data['year_of_release'] >= 2011][['genre', 'total_sales']]
genre_analysis = genre_analysis.groupby('genre')['total_sales'].sum().reset_index()

display(genre_analysis)

genre_analysis.info()
genre total_sales
0 Action 557.80
1 Adventure 45.04
2 Fighting 66.90
3 Misc 139.42
4 Platform 88.68
5 Puzzle 9.85
6 Racing 88.31
7 Role-Playing 245.63
8 Shooter 402.91
9 Simulation 50.26
10 Sports 237.15
11 Strategy 22.16
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   genre        12 non-null     object 
 1   total_sales  12 non-null     float64
dtypes: float64(1), object(1)
memory usage: 320.0+ bytes
In [34]:
# Sort genres by total sales

genre_analysis = genre_analysis.sort_values(by='total_sales', ascending=False)

display(genre_analysis)
genre total_sales
0 Action 557.80
8 Shooter 402.91
7 Role-Playing 245.63
10 Sports 237.15
3 Misc 139.42
4 Platform 88.68
6 Racing 88.31
2 Fighting 66.90
9 Simulation 50.26
1 Adventure 45.04
11 Strategy 22.16
5 Puzzle 9.85
In [35]:
# Visualize genre distribution

fig = px.bar(genre_analysis, x='genre', y='total_sales', height=600,
             color='genre', labels={'total_sales' : 'Total Sales (USD Million)', 'genre' : 'Genre'},
             title='Which Genre Sells the Most?')

fig.update_layout(showlegend=False)
fig.show()
In [36]:
# Calculate market share for each genre

sales_sum = genre_analysis['total_sales'].sum()

genre_analysis['market_share'] = ((genre_analysis['total_sales'] / sales_sum) * 100).map('{:.2f}%'.format)

display(genre_analysis)

#market_share_100 = genre_analysis['market_share'].sum()
#print(market_share_100)
genre total_sales market_share
0 Action 557.80 28.54%
8 Shooter 402.91 20.62%
7 Role-Playing 245.63 12.57%
10 Sports 237.15 12.14%
3 Misc 139.42 7.13%
4 Platform 88.68 4.54%
6 Racing 88.31 4.52%
2 Fighting 66.90 3.42%
9 Simulation 50.26 2.57%
1 Adventure 45.04 2.30%
11 Strategy 22.16 1.13%
5 Puzzle 9.85 0.50%

Action games are leading the pack substantially, with Shooter trailing close behind. These 2 genres make up almost half (~49%) of the total gaming market recently (2011 - 2016) by themselves. Again, this is what we have been seeing in previous observations. 'Grand Theft Auto V' and the 'Call of Duty' series are the best selling titles in this sample, and having been released on multiple platforms accross multiple generations secures their market positions further.

Reviewer's comment v1

Great job!

Key Questions for Genre Analysis:¶

  • Which genres consistently perform well?
  • Are there any genres showing recent growth or decline?
  • How does the average performance vary across genres?

Step 4: Regional Market Analysis and User Profiles¶

In this section, we will analyze the gaming market characteristics across three major regions: North America (NA), Europe (EU), and Japan (JP). Our analysis will focus on platform preferences, genre popularity, and the impact of ESRB ratings in each region.

4.1 Regional Platform Analysis¶

Let's begin by examining platform performance across different regions:

In [37]:
# Function to analyze platform performance by region

def regional_sales(data: pd.DataFrame, desired_region: str) -> pd.DataFrame:
    
    selected = data[data['year_of_release'] >= 2011]
    columns = ['platform', desired_region]
    
    result = selected[columns].groupby('platform')[desired_region].sum().reset_index()
    return result.sort_values(by=desired_region, ascending=False)

#jp_user = data[(data['year_of_release'] >= 2011) & (data['jp_sales'] >= 1.0)][['name', 'platform', 'year_of_release',
                                                                               #'genre', 'jp_sales', 'critic_score',
                                                                               #'user_score', 'rating']].sort_values(by='jp_sales')
#display(jp_user)
#jp_user.info()

#data.info()
#display(data)
In [38]:
# Analyze each region

regional_plotly = regional_sales(data, 'na_sales').merge(regional_sales(data, 'eu_sales'), on='platform', how='outer')
regional_plotly = regional_plotly.merge(regional_sales(data, 'jp_sales'), on='platform', how='outer')
regional_plotly = regional_plotly.rename(columns={'platform' : 'Platform', 'na_sales' : 'NA Sales', 'eu_sales' : 'EU Sales', 'jp_sales' : 'JP Sales'})

display(regional_plotly)

#display(regional_sales(data, 'na_sales'), regional_sales(data, 'eu_sales'), regional_sales(data, 'jp_sales'))
Platform NA Sales EU Sales JP Sales
0 X360 226.80 117.10 3.35
1 PS3 168.26 163.10 49.39
2 PS4 108.74 141.09 15.96
3 XOne 93.12 51.59 0.34
4 3DS 82.65 61.27 100.62
5 Wii 48.56 29.60 8.97
6 WiiU 38.10 25.13 13.01
7 PC 29.88 56.70 0.00
8 DS 18.86 9.24 8.04
9 PSV 12.47 13.07 21.84
10 PSP 2.19 1.77 24.04
11 PS2 0.13 0.13 0.09

This is an exciting view into which regions gravitate towards certain platforms. I wasn't aware that X360 had so many more overall sales in North America, and I'm surprised at the distribution compared to Europe. It seems like the X360 and XOne just do not perform as well in EU and JP, as they do in NA. The same can be said about the handhelds ('3DS', 'PSV', 'PSP') and their performance in Japan vs other regions.

Cross-Regional Platform Comparison¶

Let's create a comparative analysis of platform performance across regions:

In [39]:
# Create a comparative platform analysis
In [40]:
# Visualize cross-regional comparison for top platforms

fig = px.histogram(regional_plotly, x='Platform', y=['NA Sales', 'EU Sales', 'JP Sales'],
                   barmode='group', height=700, labels={'value' : 'Total Sales (USD Million)', 'variable' : 'Region'},
                   title='Regional Market Analysis on Platforms')

fig.show()

#display(regional_plotly)

4.2 Regional Genre Analysis¶

Now let's examine genre preferences across regions:

In [41]:
# Function to analyze genre performance by region

def regional_genre(data: pd.DataFrame, desired_region: str) -> pd.DataFrame:
    
    selected = data[data['year_of_release'] >= 2011]
    columns = ['genre', desired_region]
    
    result = selected[columns].groupby('genre')[desired_region].sum().reset_index()
    return result.sort_values(by=desired_region, ascending=False)

regional_genre_ex = regional_genre(data, 'na_sales').merge(regional_genre(data, 'eu_sales'), on='genre', how='outer')
regional_genre_ex = regional_genre_ex.merge(regional_genre(data, 'jp_sales'), on='genre', how='outer')
regional_genre_ex = regional_genre_ex.rename(columns={'genre' : 'Genre', 'na_sales' : 'NA Sales', 'eu_sales' : 'EU Sales', 'jp_sales' : 'JP Sales'})

display(regional_genre_ex)
Genre NA Sales EU Sales JP Sales
0 Action 230.92 199.12 63.62
1 Shooter 194.43 148.02 11.92
2 Sports 108.70 88.48 11.09
3 Role-Playing 83.78 62.30 79.87
4 Misc 69.69 39.72 17.60
5 Platform 38.28 30.07 12.13
6 Racing 31.98 40.58 5.64
7 Fighting 31.63 16.24 12.38
8 Simulation 15.37 19.72 11.64
9 Adventure 14.76 13.96 12.07
10 Strategy 7.48 8.02 4.78
11 Puzzle 2.74 3.56 2.91

Cross-Regional Genre Comparison¶

Let's compare genre preferences across regions:

In [42]:
# Create a comparative genre analysis

fig = px.histogram(regional_genre_ex, x='Genre', y=['NA Sales', 'EU Sales', 'JP Sales'], barmode='group',
                   height=600, labels={'variable' : 'Region', 'value' : 'Total Sales (USD Million)'},
                   title='How do Genres Perform in each Market Region?')

fig.show()

4.3 ESRB Rating Impact Analysis¶

Finally, let's examine how ESRB ratings affect sales in each region:

In [43]:
# Function to analyze ESRB rating impact

def regional_rating(data: pd.DataFrame, desired_region: str) -> pd.DataFrame:
    
    selected = data[data['year_of_release'] >= 2011]
    columns = ['rating', desired_region]
    
    result = selected[columns].groupby('rating')[desired_region].sum().reset_index()
    return result.sort_values(by=desired_region, ascending=False)

#data.info()
#display(data)
In [44]:
# Analyze ESRB impact for each region

regional_rating_ex = regional_rating(data, 'na_sales').merge(regional_rating(data, 'eu_sales'), on='rating', how='outer')
regional_rating_ex = regional_rating_ex.merge(regional_rating(data, 'jp_sales'), on='rating', how='outer')
regional_rating_ex = regional_rating_ex.rename(columns={'rating' : 'Rating', 'na_sales' : 'NA Sales', 'eu_sales' : 'EU Sales', 'jp_sales' : 'JP Sales'})

display(regional_rating_ex)

fig = px.histogram(regional_rating_ex, x='Rating', y=['NA Sales', 'EU Sales', 'JP Sales'],
                   barmode='group', height=600, labels={'variable' : 'Region', 'value' : 'Total Sales (USD Million)'},
                   title='What Impact Does Rating Have on Sales?')

fig.show()
Rating NA Sales EU Sales JP Sales
0 M 308.45 250.26 26.14
1 E 170.17 141.73 36.68
2 E10+ 117.72 78.00 11.23
3 T 109.31 83.16 33.39
4 EC 0.42 0.00 0.00
5 RP 0.00 0.03 0.00

*people like a Mature game*

'Grand Theft Auto V' & 'Call of Duty' really rocked the gaming market hard in favor of Action and Shooter games rated M. With Japan consuming substantially less 'M'/'Mature' content, and even consuming more titles rated 'E'/'Everyone'. This also makes sense, since a lot of handheled titles are rated 'E'/'Everyone', and Japan leads in handheld platform consumption.

Reviewer's comment v1

Well done!

Step 5 : Hypothesis Tests¶

—Average user ratings of the Xbox One and PC platforms are the same.

—Average user ratings for the Action and Sports genres are different.

Set the alpha threshold value yourself.

Explain:

—How you formulated the null and alternative hypotheses

—What criteria you used to test the hypotheses, and why

In [45]:
        #1.Average user ratings of the XOne == PC

#Null-Hypothesis: XOne['user_rating'].mean() == PC['user_rating'].mean()
#Alternate-Hypothesis: XOne['user_rating'].mean() != PC['user_rating'].mean()

xone_avg = data[(data['year_of_release'] >= 2011) & (data['platform'] == 'XOne')]['user_score'].dropna()
pc_avg = data[(data['year_of_release'] >= 2011) & (data['platform'] == 'PC')]['user_score'].dropna()

alpha = .05

results = st.ttest_ind(xone_avg, pc_avg, equal_var=False)

print(f'p-value: ', results.pvalue)

if results.pvalue < alpha:
    print(f'We reject the Null Hypothesis')
else:
    print(f'We cannot reject the Null Hypothesis')
p-value:  0.6130712247638477
We cannot reject the Null Hypothesis

*Null-Hypothesis: XOne['user_rating'].mean() == PC['user_rating'].mean()*

*Alternate-Hypothesis: XOne['user_rating'].mean() != PC['user_rating'].mean()*

With a statistical significance of 5% or .05, there is not sufficient evidence to reject the null hypothesis. We can conclude that the average user ratings of XOne and PC platforms are the same.

I chose to use 2 series objects containing the user scores of all games from both platforms respectively, and staying true to our relevant time period of 2011 - 2016. This allowed me to test the means from two statistical populations (platform) that cannot be assumed to share a statistical variance (XOne vs PC).

In [46]:
        #Average user ratings of Action != Sports

#Null-Hypothesis: Action['user_rating'].mean() != Sports['user_rating'].mean()
#Alternate-Hypothesis: Action['user_rating'].mean() == Sports['user_rating'].mean()

action_avg = data[(data['year_of_release'] >= 2011) & (data['genre'] == 'Action')]['user_score'].dropna()
sports_avg = data[(data['year_of_release'] >= 2011) & (data['genre'] == 'Sports')]['user_score'].dropna()

alpha = .05

results = st.ttest_ind(action_avg, sports_avg, equal_var=False)

print(f'p-value: ', results.pvalue)

if results.pvalue < alpha:
    print(f'We reject the Null Hypothesis')
else:
    print(f'We cannot reject the Null Hypothesis')

#display(action_avg, sports_avg)
#data.info()
#display(data)
p-value:  8.285648549432565e-19
We reject the Null Hypothesis

*Null-Hypothesis: Action['user_rating'].mean() != Sports['user_rating'].mean()*

*Alternate-Hypothesis: Action['user_rating'].mean() == Sports['user_rating'].mean()*

With a statistical significance of 5% or .05, there is sufficient evidence to reject the null hypothesis. We can conclude that the average user ratings of Action and Sports genres are the same.

**I chose to use 2 series objects containing the user scores of all games from both genres respectively ('Action' & 'Sports'), and staying true to our relevant time period of 2011 - 2016. This allowed me to test the means from two statistical populations (genre) that cannot be assumed to share a statistical variance ('Action' vs 'Sports').**

Reviewer's comment v1

Well done!

Step 6. Write a general conclusion¶

*Remarkable observations:*

1. Many titles are missing some combination of [user/critic/rating], with no observable pattern.

2. 'PC' is a veteran, emerging in 1985, and staying consistent through time into the modern era (2011 - 2016)

3. 'PC' can almost be considered the main artery in which the others branch from, and no generational successor in sight due to missing hardware 'constraints'. A constantly evolving platform, while time is a concept only to the others.

4. 'PlayStation' and 'XBox' would spear-head a surge into the gaming market, with 'Playstation' seeing 100 Million USD in sales within 4 years of launching, and 'XBox' within 8 years.

5. Our relevant time period (2011 - 2016) was selected based on observational trajectory in total number of titles being released, which inevitably impacts the rate at which developers make profit. Probable causes include developement time demand, scale of projects, etc.

6. Our data represents the 'passing of the torch', or a period in time where titles are being released on generationally different hardware ['PS3' -> 'PS4'], maybe as a way to give users the time to make the hardware change, a grace period if you will.

7. Handhelds and Consoles follow similar trends, in the interesting case of '3DS' showing an increase in sales when titles stopped releasing for the previous generation 'DS'. Probable cause is removal of hardware/software computational power difference constraints. Innovating ideas bring increase in sales. We have yet to see if the same goes for the consoles, likely.

8. 'PS3' and 'X360' struck gold with the release of 'Grand Theft Auto V' and 'Call of Duty', setting the Action & Shooter genres to be leagues ahead of others in sales.

9. Japan consumes significantly less rated 'M'/'Mature' content than the other regions [North America, Europe]. With Japan even consuming more content rated 'E'/'Everyone', in correlation to the '3DS' handheld reaching 100 Million USD in title sales, doubling their runner up 'PS3'.

10. Action and Shooter titles have gapped the other genres with similar distributions between both North America and Europe, with Role-Playing as Japan's highest selling genre.

11. 'Mature' games have a significant market lead, with 'Everyone' trailing at about 55% the same market saturation.

12. The average user ratings of [XOne/PC] and [Action/Sports] are the same.

Reviewer's comment v1

You did an excellent job analyzing the data and drawing conclusions!

It’s great that you pay attention to formulating interim conclusions in your project-this is an excellent practice that will benefit you in future projects. Keep using this approach, as it makes your analysis more structured and easier for readers and colleagues to understand.

Additionally, it’s fantastic that you included valuable insights about trends in the gaming industry. You did a great job highlighting key points and interpreting them clearly.

Overall, excellent work! Keep it up! 😊

In [ ]: